class: center, middle, title-slide # Basic Data Wrangling and Data Visualization in R ### Luke Tierney ### University of Iowa ### 21 June, 2021 --- class: center, middle <style type="text/css"> .content-box-blue { background-color: lightblue; } .small-font { font-size: 70%; } .note { padding: 15px; margin-bottom: 20px; border: 1px solid transparent; border-radius: 4px; background-color: #d9edf7; border-color: #bce8f1; color: #31708f; } </style> # Introduction --- ## Outline In this class I will * Briefly outline the history of R. -- * Using some examples briefly show how to do data wrangling and visualize data in R. -- Materials for this class are available on GitHub at <https://github.com/ltierney/SIBS-WV-2021.git>. -- * You can access it as an RStudio project by following the menu selection **File > New Project > Version Control > Git** and specifying this URL. -- * You can use the `git` command line client with ```shell git clone https://github.com/ltierney/SIBS-WV-2021.git ``` -- Materials for our _Data Visualization and Data Technologies_ course are available at <http://www.stat.uiowa.edu/~luke/classes/STAT4580-2021/> --- ## Tools Some tools I will be using: -- * The [RStudio](https://www.rstudion.com) IDE. -- * Many features from the basic [R](https://www.r-project.org) distribution. -- * Some tools from the [_tidyverse_](https://www.tidyverse.org/). -- * The [`ggplot`](https://ggplot2.tidyverse.org/) package based on the _Grammar of Graphics_ framework. -- Most of the packages are loaded by loading the `tidyverse` package: ```r library(tidyverse) ``` --- ## References Useful references: -- > Hadley Wickham and Garrett Grolemund (2016), [_R for Data > Science_](http://r4ds.had.co.nz/), O'Reilly. -- > Claus O. Wilke (2019), [_Fundamentals of Data > Visualization_](https://serialmentor.com/dataviz/), O'Reilly. -- > Kieran Healy (2018) [_Data Visualization: A practical > introduction_](http://socviz.co/), Princeton -- > Rafael A. Irizarry (2019), [Introduction to Data Science: _Data > Analysis and Prediction Algorithms with > R_](https://rafalab.github.io/dsbook/), Chapman & Hall/CRC. ([Book > source on GitHub](https://github.com/rafalab/dsbook)) -- There is also a small interactive [tutorial](../tutorial/penguins.Rmd) available. -- **Ask questions any time!** --- class: center, middle # The R Language --- ## Background R is a language for data analysis and graphics. -- * R was originally developed by Robert Gentleman and Ross Ihaka in the early 1990's for a Macintosh computer lab at U. of Auckland, New Zealand. -- * R is based on the S language developed by John Chambers and others at Bell Labs. -- R is an Open Source project. -- * Since 1997 R is developed and maintained by the R-core group, with around 20 members located in maor than 10 different countries. -- * R is widely used in the field of statistics and beyond, especially in university environments. -- * R has become the primary framework for developing and making available new statistical methodology. -- * Many (now over 17,000) extension packages are available through CRAN or similar repositories. --- ## Working with R R is designed for interactive data exploration. -- * Interaction is through a _read-eval-print loop (REPL)_. -- * This is also called a _command line interface (CLI)_. -- All computations are specified in the R language. -- * Even for simple tasks you need to know a little of the language. -- * After learning to do simple tasks you know some of the language. -- The language is used to -- * prepare data for analysis; -- * specify individual analyses; -- * program repeated or similar analyses; -- * program new methods of analysis. -- Specifying these tasks in a language supports _reproducible research_. --- ## Working with R The R language operates on vectors and arrays. -- Commonly used data types are: -- * integer and numeric vectors; -- * logical vectors; -- * character vectors; -- * factors. -- All basic vector types support missing (`NA`) values. -- Arithmetic operations are vectorized to operate element-wise on vectors. -- Data vectors are usually combined into table-like objects called _data frames_. --- ## The Data Analysis Process A figure that shows the steps usually involved in a data analysis project: <center>
</center> -- These steps are often repeated many times, so it is important to make your work reproducible. --- ## Reproducible Data Analysis Making your work reproducible: -- * Save you work in a text file or notebook. -- * Track changes to your files with a version control system like [`git`](https://git-scm.com/). -- * Use a system like [Rmarkdown](https://rmarkdown.rstudio.com) to prepare your reports. -- This allows you to re-create your report when data changes (as it often will!). -- A good resource for setting up your tools to support this is [_Happy Git and GitHub for the useR_](https://happygitwithr.com/). --- class: center, middle # Some Examples --- ## Example Data Sets When working with research data, a first step is usually to read and clean the data. -- We'll put that off for a little while and work with some data sets made available in R packages. -- Data sets available in R packages include: -- * many classic data sets; -- * newer, often larger, data sets useful for learning; -- * current data obtained by querying web APIs. --- ## Old Faithful Eruptions A simple classic data set is the `geyser` data frame available in package `MASS`. -- .pull-left[ ```r data(geyser, package = "MASS") dim(geyser) ## [1] 299 2 head(geyser, 4) ## waiting duration ## 1 80 4.016667 ## 2 71 2.150000 ## 3 57 4.000000 ## 4 80 4.000000 ``` ] -- .pull-right[ <div class = "note"> `head` and `tail` return the first and last few rows of a data frame. They are useful for quick sanity checks. </div> <!-- foobar - don't ask --> ] -- The rows represent measurements recorded for eruptions of the [_Old Faithful_](https://www.yellowstonepark.com/things-to-do/geysers-hot-springs/about-old-faithful/) geyser in Yellowstone National Park, Wyoming. -- The variables are: * `waiting`: the time in minutes since the previous eruption; -- * `duration`: the duration in minutes of the eruption. --- ## Old Faithful Eruptions The durations have a bimodal distribution: .pull-left[ <!-- --> ] -- .pull-right[ ```r ggplot(geyser) + geom_histogram(aes(x = duration), bins = 15, color = "black", fill = "grey") ``` {{content}} ] -- This is [`ggplot`](https://ggplot2.tidyverse.org/) code for creating a histogram. <!-- # nolint start --> {{content}} -- <div class = "note"> A basic template for creating a plot with `ggplot`: ```r ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>)) ``` </div> <!-- # nolint end --> --- ## Old Faithful Eruptions An interesting question is whether the duration can be used to predict when the _next_ eruption will occur. -- A plot of the _previous_ duration against the waiting time to the current eruption: -- .pull-left[ <!-- --> ] -- .pull-right[ ```r ggplot(geyser) + geom_point(aes(x = lag(duration), y = waiting)) ``` {{content}} ] -- It looks like a useful rule would be to expect a shorter waiting time after a shorter eruption duration. --- ## Old Faithful Eruptions An interesting feature: -- Many durations are recorded as 2 or 4 minutes. -- This can also be seen in a histogram with a small bin width: .pull-left[ <!-- --> ] -- .pull-righ[ ```r p <- ggplot(geyser) + geom_histogram(aes(x = duration, y = stat(density)), fill = "grey", color = "black", * binwidth = 0.1) p ``` {{content}} ] -- <div class = "note"> `ggplot` produces a plot object. Drawing only happens when the object is printed. </div> <!-- foobar - don't ask --> --- ## Old Faithful Eruptions Does this rounding matter? -- * For many analyses it probably doesn't. -- * It might if you wanted to fit normal distributions to the two groups. -- .pull-left[ Taking 3 minutes as the divide between short and long durations we can first pick out the short and long durations: ```r d <- geyser$duration d_short <- d[d < 3] d_long <- d[d >= 3] ``` ] -- .pull-right[ Then compute the means and standard deviations as ```r mean(d_short) ## [1] 1.980317 sd(d_short) ## [1] 0.2779829 mean(d_long) ## [1] 4.262113 sd(d_long) ## [1] 0.3937525 mean(d >= 3) ## [1] 0.6488294 ``` ] --- ## Old Faithful Eruptions An approach that scales better: -- Compute group summaries using tools from the `dplyr` tidyverse package. -- First, add a `type` variable: ```r geyser <- mutate(geyser, type = ifelse(duration < 3, "short", "long")) ``` -- The summaries can then be computed as .pull-left[ ```r sgd <- summarize(group_by(geyser, type), mean = mean(duration), sd = sd(duration), n = n()) (sgd <- mutate(sgd, prop = n / sum(n))) ## # A tibble: 2 x 5 ## type mean sd n prop ## <chr> <dbl> <dbl> <int> <dbl> ## 1 long 4.26 0.394 194 0.649 ## 2 short 1.98 0.278 105 0.351 ``` ] -- .pull-right[ <div class = "note"> The functions `summarize`, `group_by`, and `mutate` are from the `dplyr` package that implements a _grammar of data manipulation_. </div> <!-- foobar - don't ask --> ] --- ## Old Faithful Eruptions This computation can also be written using the _forward pipe operator_ `%>%`: -- ```r sgd <- group_by(geyser, type) %>% summarize(mean = mean(duration), sd = sd(duration), n = n()) %>% ungroup() %>% mutate(prop = n / sum(n)) sgd ## # A tibble: 2 x 5 ## type mean sd n prop ## <chr> <dbl> <dbl> <int> <dbl> ## 1 long 4.26 0.394 194 0.649 ## 2 short 1.98 0.278 105 0.351 ``` -- <div class = "note"> The pipe operator allows a sequence of operations to be chained together. -- The left-hand operation is passed implicitly as the first argument to the function called on the right. </div> <!-- foobar - don't ask --> --- ## Old Faithful Eruptions One way to show the superimposed normal densities: .pull-left[ <!-- --> ] -- .pull-right[ ```r f1 <- function(x) sgd$prop[1] * dnorm(x, sgd$mean[1], sgd$sd[1]) f2 <- function(x) sgd$prop[2] * dnorm(x, sgd$mean[2], sgd$sd[2]) p <- p + stat_function(color = "red", fun = f1) + stat_function(color = "blue", fun = f2) p ``` {{content}} ] -- <div class = "note"> A `ggplot` can consist of several _layers_. </div> <!-- foobar - don't ask --> --- ## Old Faithful Eruptions The means and standard deviations are affected by the rounding. -- Summaries that omit values equal to 2 or 4 minutes can be computed as ```r geyser2 <- filter(geyser, duration != 2, duration != 4) sgd2 <- group_by(geyser2, type) %>% summarize(mean = mean(duration), sd = sd(duration), n = n()) %>% ungroup() %>% mutate(prop = n / sum(n)) sgd2 ## # A tibble: 2 x 5 ## type mean sd n prop ## <chr> <dbl> <dbl> <int> <dbl> ## 1 long 4.36 0.422 141 0.632 ## 2 short 1.97 0.315 82 0.368 ``` --- ## Old Faithful Eruptions A plot showing curves computed both ways: -- .pull-left[ <!-- --> ] -- .pull-right[ ```r f1_2 <- function(x) sgd2$prop[1] * dnorm(x, sgd2$mean[1], sgd2$sd[1]) f2_2 <- function(x) sgd2$prop[2] * dnorm(x, sgd2$mean[2], sgd2$sd[2]) p <- p + stat_function(color = "red", linetype = 2, fun = f1_2) + stat_function(color = "blue", linetype = 2, fun = f2_2) p ``` ] --- ## Minnesota Barley Yields Another classic data set: -- Total yield in bushels per acre for 10 varieties at 6 sites in Minnesota in each of two years, 1931 and 1932. -- The raw data: ```r data(barley, package = "lattice") head(barley) ## yield variety year site ## 1 27.00000 Manchuria 1931 University Farm ## 2 48.86667 Manchuria 1931 Waseca ## 3 27.43334 Manchuria 1931 Morris ## 4 39.93333 Manchuria 1931 Crookston ## 5 32.96667 Manchuria 1931 Grand Rapids ## 6 28.96667 Manchuria 1931 Duluth ``` --- ## Minnesota Barley Yields Some initial plots: ```r p1 <- ggplot(barley) + geom_point(aes(x = yield, y = variety)) p2 <- ggplot(barley) + geom_point(aes(x = yield, y = site)) cowplot::plot_grid(p1, p2) ``` <!-- --> --- ## Minnesota Barley Yields Using color to separate yields in the two years: ```r p1 <- ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year)) p2 <- ggplot(barley) + geom_point(aes(x = yield, y = site, color = year)) cowplot::plot_grid(p1, p2) ``` <!-- --> --- ## Minnesota Barley Yields Can we also show `site` using symbol shape? -- .pull-left[ <!-- --> ] -- .pull-right[ ```r ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year, * shape = site)) ``` {{content}} ] -- There is a lot of _interference_ between shape and color. --- ## Minnesota Barley Yields Larger points may help: .pull-left[ <!-- --> ] .pull-right[ ```r ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year, shape = site), * size = 2.5) ``` ] --- ## Minnesota Barley Yields _Jittering_ may also help: .pull-left[ <!-- --> ] .pull-right[ ```r ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year, shape = site), size = 2.5, * position = * position_jitter( * height = 0.15, * width = 0)) ``` ] --- ## Minnesota Barley Yields .pull-left[ Another approach: _faceting_ to produce _small multiples_. ```r ggplot(barley) + geom_point(aes(x = yield, y = variety, color = year)) + * facet_wrap(~site, ncol = 2) ``` ] .pull-right[ <!-- --> ] --- ## Minnesota Barley Yields Focusing on summaries can help. -- A _dot plot_ of average yields for each site and year: .pull-left[ <!-- --> ] -- .pull-right[ ```r barley_site_year <- group_by(barley, site, year) %>% summarize(yield = mean(yield)) %>% ungroup() ggplot(barley_site_year) + geom_point(aes(y = site, x = yield, color = year), size = 3) ``` ] --- ## Minnesota Barley Yields Adding lines can help comparing the changes. This is sometimes called a _dumbbell chart_: .pull-left[ <!-- --> ] .pull-right[ ```r barley_site_year <- group_by(barley, site, year) %>% summarize(yield = mean(yield)) %>% ungroup() ggplot(barley_site_year) + * geom_line(aes(y = site, * x = yield, * group = site), * color = "darkgrey", * size = 2) + geom_point(aes(y = site, x = yield, color = year), * size = 4) ``` ] --- ## Minnesota Barley Yields Another useful approach for showing repeated measurements is a _slope graph_: .pull-left[ .hide-code[ ```r library(ggrepel) barley_site_year <- mutate(barley_site_year, year = fct_rev(year)) barley_site_year_1932 <- filter(barley_site_year, year == "1932") ggplot(barley_site_year, aes(x = year, y = yield, group = site)) + geom_line() + geom_text_repel(aes(label = site), data = barley_site_year_1932, hjust = "left", direction = "y") + scale_x_discrete(expand = expansion(mult = c(0.1, .25)), position = "top") + labs(x = NULL, y = "Average Yield") ``` <!-- --> ] ] -- .pull-right[ This emphasizes the reversal for Morris. ] --- ## Minnesota Barley Yields _Bar charts_ are sometimes used for summaries, but dot plots are usually a better choice. .pull-left[ <!-- --> ] .pull-right[ ```r ggplot(barley_site_year) + geom_col(aes(x = yield, y = site, fill = year), size = 3, position = "dodge", width = .4) ``` ] --- ## Bar Charts and the Zero Base Line Because of the way we perceive bars, it is important to use a [zero base line for bar charts](https://flowingdata.com/2015/08/31/bar-chart-baselines-start-at-zero/). --  --  --- ## Hair and Eye Color Data A data set recording the distribution of hair and eye color and sex in 592 statistics students. .pull-left[ ```r HairEyeDF <- as.data.frame(HairEyeColor) head(HairEyeDF) ## Hair Eye Sex Freq ## 1 Black Brown Male 32 ## 2 Brown Brown Male 53 ## 3 Red Brown Male 10 ## 4 Blond Brown Male 3 ## 5 Black Blue Male 11 ## 6 Brown Blue Male 50 ``` ] -- .pull-right[ The data set is available as a _cross-tabulation_. {{content}} ] -- `as.data.frame` converts it to a data frame. --- ## Hair and Eye Color Data Looking at the distribution of eye color: -- .pull-left[ <!-- --> ] -- .pull-right[ ```r eye <- group_by(HairEyeDF, Eye) %>% summarize(Freq = sum(Freq)) %>% ungroup() ggplot(eye) + geom_col(aes(x = Eye, y = Freq), position = "dodge") ``` ] --- ## Hair and Eye Color Data Mapping eye color to bar color in addition to the horizontal axis position can help: .pull-left[ <!-- --> ] .pull-right[ ```r eye <- group_by(HairEyeDF, Eye) %>% summarize(Freq = sum(Freq)) %>% ungroup() ggplot(eye) + geom_col(aes(x = Eye, y = Freq, * fill = Eye), position = "dodge") ``` ] --- ## Hair and Eye Color Data More sensible colors would be nice but require a bit of work: .pull-left[ <!-- --> ] .pull-right[ ```r hazel_rgb <- col2rgb("brown") * 0.75 + col2rgb("green") * 0.25 hazel <- do.call(rgb, as.list(hazel_rgb / 255)) cols <- c(Blue = colorspace::lighten(colorspace::desaturate("blue", 0.3), 0.3), Green = colorspace::lighten("forestgreen", 0.1), Brown = colorspace::lighten("brown", 0.0001), ## 0.3? Hazel = colorspace::lighten(hazel, 0.3)) pb <- ggplot(eye) + geom_col(aes(x = Eye, y = Freq, fill = Eye), position = "dodge") + scale_fill_manual(values = cols) pb ``` ] --- ## Hair and Eye Color Data A _stacked bar chart_ can also be useful: .pull-left[ <!-- --> ] .pull-right[ ```r psb <- ggplot(eye) + geom_col(aes(x = "", y = Freq, fill = Eye), color = "lightgrey") + scale_fill_manual(values = cols) psb ``` ] --- ## Hair and Eye Color Data Changing to polar coordinates produces a _pie chart_: .pull-left[ <!-- --> ] .pull-right[ ```r (pp <- psb + coord_polar("y")) ``` ] --- ## Hair and Eye Color Data The axis and grid are not helpful; a _theme_ adjustment can remove them: .pull-left[ <!-- --> ] .pull-right[ ```r (pp <- pp + theme_void()) ``` {{content}} ] -- <div class = "note"> Themes provide a way to customize the non-data components of plots: i.e. titles, labels, fonts, background, grid lines, and legends. {{content}} -- Themes can be used to give plots a consistent customized look. {{content}} -- The `ggthemes` package provides a number of themes to emulate the style of different publications, for example `theme_wsj` and `theme_economist`. </div> <!-- foobar - don't ask --> --- ## Hair and Eye Color Data How well do bar charts and pie charts work? -- .pull-left[ <!-- --> ] -- .pull-right[ Some questions: {{content}} ] -- * Which plot makes it easier to tell whether the proportion of brown-eyed students is larger or smaller than the proportion of blue-eyed students? {{content}} -- * Which plot makes it easier to tell whether these proportions are larger or smaller than 1/2 or 1/4 or 1/3? --- ## Hair and Eye Color Data Looking at the proportions within hair color and sex: .hide-code[ ```r eye_hairsex <- group_by(HairEyeDF, Hair, Sex) %>% mutate(Prop = Freq / sum(Freq)) %>% ungroup() p1 <- ggplot(eye_hairsex) + geom_col(aes(x = Eye, y = Prop, fill = Eye)) + scale_fill_manual(values = cols) + facet_grid(Hair ~ Sex) p2 <- ggplot(eye_hairsex) + geom_col(aes(x = "", y = Prop, fill = Eye)) + scale_fill_manual(values = cols) + coord_polar("y") + facet_grid(Hair ~ Sex) + theme_void() cowplot::plot_grid(p1, p2) ``` <!-- --> ] --- ## A more complete `ggplot` template <!-- # nolint start --> ```r ggplot(data = <DATA>) + <GEOM>(mapping = aes(<MAPPINGS>), stat = <STAT>, position = <POSITION>) + < ... MORE GEOMS ... > + <COORDINATE_ADJUSTMENT> + <SCALE_ADJUSTMENT> + <FACETING> + <THEME_ADJUSTMENT> ``` <!-- # nolint end --> --- class: center, middle # Visual Perception and the Grammar of Graphics --- layout: true ## Monthly River Flows Monthly flow volumes recorded for a river in the pacific north-west. --- An initial plot using default settings: .hide-code[ ```r library(ggplot2) river <- scan(here::here("data/river.dat")) rd <- data.frame(flow = river, month = seq_along(river)) (pp <- ggplot(rd) + geom_point(aes(x = month, y = flow))) ``` <!-- --> ] --- Changing the _aspect ratio_: .hide-code[ ```r pp + coord_fixed(3.5) ``` <!-- --> ] --- Time series are often visualized with a line plot: .hide-code[ ```r pl <- ggplot(rd) + geom_line(aes(x = month, y = flow)) pl + coord_fixed(3.5) ``` <!-- --> ] --- The seasonal variation can be seen with a line plot in the original aspect ratio: .hide-code[ ```r pl ``` <!-- --> ] --- layout: false ## A Simple Model of Visual Perception The eyes acquire an image, which is processed through three stages of memory: -- * Iconic memory -- * Working memory, or short-term memory -- * Long-term memory -- The first processing stage of an image happens in iconic memory. -- * Images remain in iconic memory for less than a second. -- * Processing in iconic memory is massively parallel and automatic. -- * This is called _preattentive processing_. -- Preattentive processing is a fast recognition process. --- ## A Simple Model of Visual Perception Meaningful visual chunks are moved from iconic memory to short term memory. -- * These chunks are used by conscious, or attentive, processing. -- * Attentive processing often involves conscious comparisons or search. -- * Short term memory is limited; * information is retained for only a few seconds; * only three or fours chunks can be held at a time. -- Long term visual memory is built up over a lifetime, though infrequently used visual chunks may become lost. -- <div class = "note"> **Visual Design Implications** Try to make as much use of preattentive features as possible. -- Recognize when preattentive features might mislead. -- For features that require attentive processing, keep in mind that working memory is limited. </div> <!-- foobar - don't ask --> --- ## Some Terms for Describing Visualizations Data to be visualized contains _variables_ or _attributes_ measured on individual _items_ or _cases_. -- _Links_ are relationships that may exist among items, e.g. months within a year or countries within a continent. -- _Marks_ are individual geometric entities used to represent items: points. bars, etc. -- _Aesthetics_ or _visual channels_ are the visual features of marks that can be used to encode attributes. -- The `aes(...)` expressions establish the mapping between attributes and visual channels. -- These ideas closely mirror the structure of the _grammar of graphics_ as implemented in `ggplot`. -- > Munzner, T. (2014), [_Visualization Analysis and > Design_](http://www.cs.ubc.ca/~tmm/vadbook/), CRC Press. > Wilkinson, L. (2005), _The Grammar of Graphics_, 2nd ed, Springer. --- ## Channels and their Accuracy A useful distinction among channels: -- * _Magnitude channels_ can reflect order and numeric values, e.g. position on an axis, length, area, brightness. -- * _Identity channels_ can distinguish different values but not reflect order, e.g. hue, shape, grouping. -- Some channels are better at conveying information than others. --- ## Channels and their Accuracy Munzner's ordering by accuracy: -- .small-font[ | Magnitude Channels (Ordered, Numerical) | Identity Channels (Categorical) | |-----------------------------------------|---------------------------------| | Position on common scale | Spatial grouping | | Position on unaligned scale | Color hue | | Length (1D size) | Shape | | Tilt, angle | | | Area (2D size) | | | Depth (3D position) | | | Color luminance, saturation | | | Curvature, volume (3D size) | | ] -- Line width is another channel; not sure there is agreement on its accuracy, but it is not high. -- <div class = "note"> **Visual Design Implications** Try to map the most important variables to the strongest channels. </div> <!-- foobar - don't ask --> --- ## Color Color is very effective when used well. -- But using color well is not easy. -- Some of the issues: -- * Perception depends on context. -- * Simple color assignments may not separate equally well. -- * Effectiveness may vary with the medium (screen, projector, print). -- * Some people do not perceive the full specturm of colors. -- * Grey scale printing. -- * Some colors have cultural significance. -- * Cultural significance may vary among cultures and with time. --- ## Color Color perception is relative: --  --  -- Groups of colors that work well together are called _palettes_. -- Some tools for selecting palettes include: -- * [ColorBrewer](http://colorbrewer2.org); available in the `RColorBrewer` package. -- * [HCL Wizard](http://www.hclwizard.org/); also available as `hclwizard` in the `colorspace` package. -- A note on [rainbow colors]( https://eeecon.uibk.ac.at/~zeileis/news/endrainbow/). --- class: center, middle # A Grammar of Data Manipulation --- ## The `dplyr` Package The `dplyr` package provides a language, or grammar, for data manipulation. -- The design of `dplyr` is strongly motivated by SQL. -- The language contains a number of _verbs_ that operate on tables. -- The most commonly used verbs operate on a single data frame: -- * `select`: pick variables by their names -- * `filter`: choose rows that satisfy some criteria -- * `mutate`: create transformed or derived variables -- * `arrange`: reorder the rows -- * `summarize`: collapse rows down to summaries --- ## The `dplyr` Package There are also a number of `join` verbs that merge several data frames into one. -- Package `tidyr` provides more verbs, such as `pivot_longer` and `pivot_wider` for reshaping data frames. -- The single table verbs can also be used with `group_by` to work separately on groups of rows. --- class: center, middle # More Examples --- ## Raw Data These examples start with raw data as you might receive it from a researcher, and involve reading and cleaning the data. -- Common data formats you might encounter include: -- * [_CSV_ (comma-separated values)](https://en.wikipedia.org/wiki/Comma-separated_values) files. -- * Text files using other delimiters, such as tabs or `|` characters. -- * [JSON(JavaScript Object Notation)](https://en.wikipedia.org/wiki/JSON) files. -- * [XML (Extensible Markup Language)](https://en.wikipedia.org/wiki/XML) files. -- * Excel spreadsheets. -- Tools are available for reading data in these formats into R. --- ## Wind Turbines in Iowa There are many wind turbines in Iowa. -- Data is available from the [U.S. Wind Turbine Database](https://eerscmap.usgs.gov/uswtdb/). -- A snapshot is available is [here](../data/us_wind.csv) as a CSV file. -- * CSV files are a common form of data exchange. -- * They are simple text files that are intended to be written and read by a computer. -- * Some CSV files include a header and a footer that need to he handled. -- * One issue is that a comma isn't a good separator in countries where it is the decimal separator! -- * A CSV file can be read using `read.csv` or `readr::read_csv`. -- Reading the wind turbine data: ```r wind_turbines <- read.csv(here::here("data/us_wind.csv"), comment = "#") ``` --- ## Wind Turbines in Iowa Some data cleaning is needed. -- Focus on the wind turbines in IOWA (19 is the state component of the [FIPS county code](https://en.wikipedia.org/wiki/FIPS_county_code) for Iowa): ```r wt_IA <- filter(wind_turbines, t_fips %/% 1000 == 19) ``` -- Drop entries with missing longitude or latitude values: ```r wt_IA <- filter(wt_IA, ! is.na(xlong), ! is.na(ylat)) ``` -- Some missing year values are encoded as -9999; replace these with `NA`: ```r wt_IA <- mutate(wt_IA, p_year = replace(p_year, p_year < 0, NA)) ``` --- ## Wind Turbines in Iowa To show the locations of wind turbines on a map, load some map data: .pull-left[ ```r iowa_sf <- sf::st_as_sf(maps::map("county", "iowa", plot = FALSE, fill = TRUE)) p <- ggplot() + geom_sf(data = iowa_sf) + ggthemes::theme_map() p ``` ] .pull-right[ <!-- --> ] --- ## Wind Turbines in Iowa Locations for all wind turbines in iowa: .pull-left[ ```r p + geom_point(aes(xlong, ylat), data = wt_IA) ``` ] .pull-right[ <!-- --> ] --- ## Wind Turbines in Iowa Using color to show when the wind turbines were built: .pull-left[ ```r year_brk <- c(0, 2005, 2010, 2015, 2020) year_lab <- c("before 2005", "2005-2009", "2010-2014", "2015-2020") wt_IA <- mutate(wt_IA, year = cut(p_year, breaks = year_brk, labels = year_lab, right = FALSE)) p + geom_point(aes(xlong, ylat, color = year), data = wt_IA, size = 3) ``` ] .pull-right[ <!-- --> ] --- ## Cancer Map The website <http://www.cancer-rates.info/ia> provides data on cancer incidence for a number of different cancers in Iowa. -- <!-- # nolint start --> <!-- # nolint end --> The data for lung and bronchus cancer in 2011 are available in a [csv file](../data/Invasive-Cancer-Incidence-Rates-by-County-in-Iowa-Lung-and-Bronchus-2011.csv) in the project. -- We can read the file with `read_csv` from the `readr` package. -- Looking at the file shows some things that need to be cleaned up: -- * Two header lines at the beginning. -- * Some footer lines. -- * Some values codes as `~`. --- ## Cancer Map The header can be handled by using `skip = 2` in the `read_csv` call: <!-- # nolint start --> ```r fname <- here::here("data/Invasive-Cancer-Incidence-Rates-by-County-in-Iowa-Lung-and-Bronchus-2011.csv") d <- read_csv(fname, skip = 2) head(d) ## # A tibble: 6 x 7 ## County `Population at … Cases `Crude Rate` `Age-adjusted R… `95% Confidence … ## <chr> <dbl> <chr> <chr> <chr> <chr> ## 1 Union 12570 20 159.11 115.82 69.86 ## 2 Ringgo… 5098 11 215.77 115.03 54.48 ## 3 Monroe 8044 12 149.18 109.99 56.24 ## 4 Page 15926 25 156.98 109.20 70.12 ## 5 Montgo… 10655 17 159.55 99.45 57.41 ## 6 Adams 3996 6 150.15 95.84 35.14 ## # … with 1 more variable: 95% Confidence Interval-Upper Limit <chr> ``` <!-- # nolint end --> --- ## Cancer Map Let's focus on a few variables and give them more convenient names: ```r d <- select(d, county = 1, population = 2, count = 3) ``` -- The footer needs to be removed: ```r tail(d) ## # A tibble: 6 x 3 ## county population count ## <chr> <dbl> <chr> ## 1 Butler 14960 5 ## 2 Winneshiek 21045 ~ ## 3 STATE 3065223 2368 ## 4 Note: All rates are per 100,000. Rates are age-adjusted to t… NA <NA> ## 5 Rates generated on Jun 12, 2019. NA <NA> ## 6 Based on data released Nov 2017. NA <NA> ``` -- One way to remove the footer: ```r d <- filter(d, ! is.na(population)) d <- filter(d, county != "STATE") ``` --- ## Cancer Map Changing `count` to numeric changes the `~` entries to missing values (`NA`) values: ```r d <- mutate(d, count = as.numeric(count)) ``` -- In this case there are no zero case values; two ways to check: .pull-left[ ```r count(d, count == 0) ## # A tibble: 2 x 2 ## `count == 0` n ## <lgl> <int> ## 1 FALSE 95 ## 2 NA 4 ``` ] -- .pull-right[ ```r any(d$count == 0, na.rm = TRUE) ## [1] FALSE ``` ] -- .pull-left[ It _might_ be reasonable to assume these values where zero, so replace them with zeros: ] -- .pull-right[ ```r d <- replace_na(d, list(count = 0)) ``` ] --- ## Cancer Map A _choropleth map_ uses color or shading to represent values measured for different geographic regions. -- We will need to merge, or _left join_, the cancer data with the map date we loaded for the wind turbine map. -- This requires a _key_ on which to match the records in the cancer data and the map data. -- For Iowa this can be done with the county name, but some care is needed: ```r d$county[1] ## [1] "Union" iowa_sf$ID[1] ## [1] "iowa,adair" ``` -- Fixing case differences and dropping the `iowa,` prefix: ```r d <- mutate(d, cname = county, county = tolower(county)) iowa_sf <- mutate(iowa_sf, ID = sub("iowa,", "", ID)) iowa_sf <- rename(iowa_sf, county = ID) ``` --- ## Cancer Map Still not quite there: ```r setdiff(d$county, iowa_sf$county) ## [1] "o'brien" setdiff(iowa_sf$county, d$county) ## [1] "obrien" ``` Drop the apostrophe in O'Brien: ```r d <- mutate(d, county = sub("'", "", county)) setdiff(d$county, iowa_sf$county) ## character(0) setdiff(iowa_sf$county, d$county) ## character(0) ``` --- ## Cancer Map Define `rate1K` variable as the number of cases per 1000 inhabitants and left join the cancer data to the map data: ```r d <- mutate(d, rate1K = 1000 * (count / population)) md <- left_join(iowa_sf, d, "county") head(md) ## Simple feature collection with 6 features and 5 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -95.10526 ymin: 40.60552 xmax: -91.06018 ymax: 43.51041 ## CRS: EPSG:4326 ## county population count cname rate1K geom ## 1 adair 7565 10 Adair 1.3218771 MULTIPOLYGON (((-94.24583 4... ## 2 adams 3996 6 Adams 1.5015015 MULTIPOLYGON (((-94.70992 4... ## 3 allamakee 14204 7 Allamakee 0.4928189 MULTIPOLYGON (((-91.22634 4... ## 4 appanoose 12863 11 Appanoose 0.8551660 MULTIPOLYGON (((-92.63009 4... ## 5 audubon 6019 5 Audubon 0.8307028 MULTIPOLYGON (((-95.10526 4... ## 6 benton 26121 28 Benton 1.0719345 MULTIPOLYGON (((-92.06286 4... ``` --- ## Cancer Map A simple map: .pull-left[ ```r ggplot(md) + geom_sf(aes(fill = rate1K)) ``` ] -- .pull-right[ <!-- --> ] --- ## Cancer Map An improved version: .pull-left[ ```r library(ggthemes) library(viridis) ggplot(md) + geom_sf(aes(fill = rate1K), color = "grey") + scale_fill_viridis(name = "Rate per 1000") + theme_map() ``` ] -- .pull-right[ <!-- --> ] --- ## Cancer Map A simple interactive version using [`plotly`](https://plot.ly/r/): .pull-left[ ```r mdl <- mutate(md, label = paste(cname, round(rate1K, 1), population, sep = "\n")) p <- ggplot(mdl) + geom_sf(aes(fill = rate1K, text = label), color = "grey") + scale_fill_viridis(name = "Rate per 1000") + theme_map() plotly::ggplotly(p, tooltip = "text") ``` ] -- .pull-right[
] --- ## Cancer Map The [`leaflet`](https://rstudio.github.io/leaflet/) package supports more sophisticated interactive maps: <!-- http://rstudio.github.io/leaflet/legends.html--> .hide-code[ ```r library(leaflet) pal <- colorNumeric(palette = "viridis", domain = md$rate1K) lab <- lapply(paste0(md$cname, "<BR>", "Rate: ", round(md$rate1K, 1), "<BR>", "Pop: ", scales::comma(md$population, accuracy = 1)), htmltools::HTML) leaflet(sf::st_transform(md, 4326)) %>% addPolygons(weight = 2, color = "grey", fillColor = ~ pal(rate1K), fillOpacity = 1, highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE), label = lab) %>% addLegend(pal = pal, values = ~ rate1K, opacity = 1) ```
] --- ## Unemployment Map [Local Area Unemployment Statistics page](https://www.bls.gov/lau/) from the Bureau of Labor Statistics makes available county-level monthly unemployment data for a 14-month window. -- The file for February 2020 through March 2021 is available is available at <http://www.stat.uiowa.edu/~luke/data/laus/laucntycur14-2020.txt> and in the project data folder. -- <div class = "note"> This file is a text file but uses a non-standard separator. -- It is designed for human readability and uses a comma as a _thousands separator_ or _grouping mark_. -- It also includes header and footer information. -- It is still reasonably easy to read in. </div> <!-- foobar - don't ask --> --- ## Unemployment Map One way to read the data into R is: ```r lausURL <- here::here("data/laucntycur14-2020.txt") lausUS <- read.table(lausURL, col.names = c("LAUSAreaCode", "State", "County", "Title", "Period", "LaborForce", "Employed", "Unemployed", "UnempRate"), quote = '"', sep = "|", skip = 6, stringsAsFactors = FALSE, strip.white = TRUE, fill = TRUE) footstart <- grep("------", lausUS$LAUSAreaCode) lausUS <- lausUS[1 : (footstart - 1), ] ``` --- ## Unemployment Map It may be useful to be able to access the county name and state name separately: ```r lausUS <- separate(lausUS, Title, c("cname", "scode"), sep = ", ", fill = "right") ``` -- Check the variable types: ```r sapply(lausUS, class) ## LAUSAreaCode State County cname scode Period ## "character" "integer" "integer" "character" "character" "character" ## LaborForce Employed Unemployed UnempRate ## "character" "character" "character" "character" ``` -- The `UnempRate` variable is read as character data because of missing value encoding, so needs to be converted to numeric: ```r lausUS <- mutate(lausUS, UnempRate = as.numeric(UnempRate)) ``` --- ## Unemployment Map Check for missing values: ```r select_if(lausUS, anyNA) %>% names() ## [1] "scode" "UnempRate" ``` -- The state code is missing for the District of Columbia: ```r select(lausUS, cname, scode) %>% filter(is.na(scode)) %>% unique() ## cname scode ## 1 District of Columbia <NA> ``` -- <!-- Missing values for `UnempRate` are all for Puerto Rico and September 2017. Hurricane Maria made landfall on September 20. --> March and April 2020 numbers were not available for Puerto Rico: ```r select(lausUS, scode, Period, UnempRate) %>% filter(is.na(UnempRate)) %>% unique() ## scode Period UnempRate ## 1 PR Mar-20 NA ## 79 PR Apr-20 NA ``` --- ## Unemployment Map To compute the national monthly unemployment rates over this period we need some more data cleaning: ```r lausUS <- mutate(lausUS, Period = fct_inorder(Period), LaborForce = as.numeric(gsub(",", "", LaborForce)), Unemployed = as.numeric(gsub(",", "", Unemployed))) ``` --- ## Unemployment Map Unemployment during this period was affected significantly by the COVID-19 pandemic. A plot shows a large spike in April 2020: .hide-code[ ```r group_by(lausUS, Period) %>% summarize(Unemployed = sum(Unemployed, na.rm = TRUE), LaborForce = sum(LaborForce, na.rm = TRUE), UnempRate = 100 * (Unemployed / LaborForce)) %>% ggplot(aes(Period, UnempRate, group = 1)) + geom_line() ``` <!-- --> ] --- ## Unemployment Map A choropleth map can be used to look at how the impact was distributed across the country. -- To show unemployment rates on a map we need to merge the unemployment data with map data. -- To match county unemployment data and county shape data it is safer to use the numeric [FIPS county code](https://en.wikipedia.org/wiki/FIPS_county_code). This can be added with ```r lausUS <- mutate(lausUS, fips = State * 1000 + County) ``` -- Shape data for US counties can be obtained from a number of sources in a number of different formats. -- Here is one approach: ```r counties_sf <- sf::st_as_sf(maps::map("county", plot = FALSE, fill = TRUE)) county.fips <- mutate(maps::county.fips, polyname = sub(":.*", "", polyname)) %>% unique() counties_sf <- left_join(counties_sf, county.fips, c("ID" = "polyname")) states_sf <- sf::st_as_sf(maps::map("state", plot = FALSE, fill = TRUE)) ``` --- ## Unemployment Map Some summaries over the period can be computed as ```r summaryUS <- group_by(lausUS, County, State, fips) %>% summarize(avg_unemp = mean(UnempRate, na.rm = TRUE), max_unemp = max(UnempRate, na.rm = TRUE), apr_unemp = UnempRate[Period == "Apr-20"]) %>% ungroup() ## `summarise()` has grouped output by 'County', 'State'. You can override using the `.groups` argument. head(summaryUS) ## # A tibble: 6 x 6 ## County State fips avg_unemp max_unemp apr_unemp ## <int> <int> <dbl> <dbl> <dbl> <dbl> ## 1 1 1 1001 4.65 10.9 10.9 ## 2 1 4 4001 12.7 17.6 16.6 ## 3 1 5 5001 4.06 5.2 4.8 ## 4 1 6 6001 8.8 14.6 14.6 ## 5 1 8 8001 8.3 12.7 12.7 ## 6 1 9 9001 8.31 11.7 8.2 ``` --- ## Unemployment Map A choropleth map of the April 2020 unemployment rates: .hide-code[ ```r left_join(counties_sf, summaryUS, "fips") %>% ggplot() + geom_sf(aes(fill = apr_unemp)) + scale_fill_viridis(name = "Rate", na.value = "red") + theme_map() + geom_sf(data = states_sf, col = "grey", fill = NA) ``` <!-- --> ] --- ## Unemployment Map Using a very visible color for missing data is useful, at least during exploration. -- `anti_join` can show the county geometry that does not have an entry in the unemployment data: ```r anti_join(counties_sf, summaryUS, "fips") ## Simple feature collection with 1 feature and 2 fields ## Geometry type: MULTIPOLYGON ## Dimension: XY ## Bounding box: xmin: -103.0121 ymin: 42.99475 xmax: -102.0782 ymax: 43.68803 ## CRS: EPSG:4326 ## ID fips geom ## 1 south dakota,shannon 46113 MULTIPOLYGON (((-102.8115 4... ``` -- Shannon County, SD (FIPS 46113), was renamed to [Oglala Lakota County](https://en.wikipedia.org/wiki/Oglala_Lakota_County,_South_Dakota) in June 2015 and given a new FIPS code, 46102. -- The geometry data table needs to be updated: ```r counties_sf <- mutate(counties_sf, fips = replace(fips, fips == 46113, 46102)) ``` --- ## Unemployment Map With the updated data the map is now complete: .hide-code[ ```r left_join(counties_sf, summaryUS, "fips") %>% ggplot() + geom_sf(aes(fill = apr_unemp)) + scale_fill_viridis(name = "Rate", na.value = "red") + theme_map() + geom_sf(data = states_sf, col = "grey", fill = NA) ``` <!-- --> ] --- ## Gapminder Childhood Mortality Data The `gapminder` package provides a subset of the data from the [Gapminder](http://www.gapminder.org/) web site. -- Additional data sets are [available](http://www.gapminder.org/data/). -- * A data set on childhood mortality is available locally as a [csv file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.csv) or an [Excel file](http://homepage.stat.uiowa.edu/~luke/data/gapminder-under5mortality.xlsx). -- * The Excel file is also available in the project data folder. -- * The numbers represent number of deaths within the first five years per 1000 births. -- <div class="note"> Many researchers like to manage their data in a spreadsheet. -- * Being able to read such a sheet directly greatly helps keeping the workflow reproducible. -- * Many spreadsheets contain header, footers, and other annotations to aid a human viewer. -- * As long as the data are in a rectangular region it is usually not hard to extract them programmatically. </div> <!-- foobar - don't ask --> --- ## Gapminder Childhood Mortality Data Loading the data: ```r library(readxl) gcm <- read_excel(here::here("data/gapminder-under5mortality.xlsx")) ``` -- A first look: ```r head(gcm, 3) ## # A tibble: 3 x 217 ## `Under five mortality` `1800.0` `1801.0` `1802.0` `1803.0` `1804.0` `1805.0` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Abkhazia NA NA NA NA NA NA ## 2 Afghanistan 469. 469. 469. 469. 469. 469. ## 3 Akrotiri and Dhekelia NA NA NA NA NA NA ## # … with 210 more variables: 1806.0 <dbl>, 1807.0 <dbl>, 1808.0 <dbl>, ## # 1809.0 <dbl>, 1810.0 <dbl>, 1811.0 <dbl>, 1812.0 <dbl>, 1813.0 <dbl>, ## # 1814.0 <dbl>, 1815.0 <dbl>, 1816.0 <dbl>, 1817.0 <dbl>, 1818.0 <dbl>, ## # 1819.0 <dbl>, 1820.0 <dbl>, 1821.0 <dbl>, 1822.0 <dbl>, 1823.0 <dbl>, ## # 1824.0 <dbl>, 1825.0 <dbl>, 1826.0 <dbl>, 1827.0 <dbl>, 1828.0 <dbl>, ## # 1829.0 <dbl>, 1830.0 <dbl>, 1831.0 <dbl>, 1832.0 <dbl>, 1833.0 <dbl>, ## # 1834.0 <dbl>, 1835.0 <dbl>, 1836.0 <dbl>, 1837.0 <dbl>, 1838.0 <dbl>, ## # 1839.0 <dbl>, 1840.0 <dbl>, 1841.0 <dbl>, 1842.0 <dbl>, 1843.0 <dbl>, ## # 1844.0 <dbl>, 1845.0 <dbl>, 1846.0 <dbl>, 1847.0 <dbl>, 1848.0 <dbl>, ## # 1849.0 <dbl>, 1850.0 <dbl>, 1851.0 <dbl>, 1852.0 <dbl>, 1853.0 <dbl>, ## # 1854.0 <dbl>, 1855.0 <dbl>, 1856.0 <dbl>, 1857.0 <dbl>, 1858.0 <dbl>, ## # 1859.0 <dbl>, 1860.0 <dbl>, 1861.0 <dbl>, 1862.0 <dbl>, 1863.0 <dbl>, ## # 1864.0 <dbl>, 1865.0 <dbl>, 1866.0 <dbl>, 1867.0 <dbl>, 1868.0 <dbl>, ## # 1869.0 <dbl>, 1870.0 <dbl>, 1871.0 <dbl>, 1872.0 <dbl>, 1873.0 <dbl>, ## # 1874.0 <dbl>, 1875.0 <dbl>, 1876.0 <dbl>, 1877.0 <dbl>, 1878.0 <dbl>, ## # 1879.0 <dbl>, 1880.0 <dbl>, 1881.0 <dbl>, 1882.0 <dbl>, 1883.0 <dbl>, ## # 1884.0 <dbl>, 1885.0 <dbl>, 1886.0 <dbl>, 1887.0 <dbl>, 1888.0 <dbl>, ## # 1889.0 <dbl>, 1890.0 <dbl>, 1891.0 <dbl>, 1892.0 <dbl>, 1893.0 <dbl>, ## # 1894.0 <dbl>, 1895.0 <dbl>, 1896.0 <dbl>, 1897.0 <dbl>, 1898.0 <dbl>, ## # 1899.0 <dbl>, 1900.0 <dbl>, 1901.0 <dbl>, 1902.0 <dbl>, 1903.0 <dbl>, ## # 1904.0 <dbl>, 1905.0 <dbl>, … ``` --- ## Gapminder Childhood Mortality Data This data set is in _wide_ format, with one column per year. -- A _long_ version with a year and a value column is useful for working with `ggplot`. -- A better first variable name: ```r names(gcm)[1] <- "country" ``` -- Convert to long format: ```r tgcm <- pivot_longer(gcm, -1, names_to = "year", values_to = "u5mort") %>% mutate(year = as.numeric(year)) head(tgcm, 3) ## # A tibble: 3 x 3 ## country year u5mort ## <chr> <dbl> <dbl> ## 1 Abkhazia 1800 NA ## 2 Abkhazia 1801 NA ## 3 Abkhazia 1802 NA ``` --- ## Gapminder Childhood Mortality Data Some explorations: .pull-left[ ```r p <- ggplot(tgcm) + geom_line(aes(year, u5mort, group = country), alpha = 0.3) plotly::ggplotly(p) ``` ] .pull-right[
] --- ## Gapminder Childhood Mortality Data Some selected countries: .pull-left[ ```r countries <- c("United States", "United Kingdom", "Germany", "China", "Egypt") filter(tgcm, country %in% countries) %>% ggplot() + geom_line(aes(x = year, y = u5mort, color = country)) ``` ] .pull-right[ <!-- --> ] --- ## Gapminder Childhood Mortality Data Examining the missing values: .pull-left[ ```r tgcm_miss <- group_by(tgcm, country) %>% summarize(anyNA = anyNA(u5mort)) %>% filter(anyNA) %>% pull(country) p <- filter(tgcm, country %in% tgcm_miss) %>% ggplot(aes(x = year, y = u5mort, group = country)) + geom_line(na.rm = TRUE) + xlim(c(1940, 2020)) plotly::ggplotly(p) ``` ] .pull-right[
]
//adapted from Emi Tanaka's gist at //https://gist.github.com/emitanaka/eaa258bb8471c041797ff377704c8505